Introduction

In our cta-seq experiment we want to sequence the same set of cells twice at vastly different sequencing depths. To do so we first sequenced an aliquote of a sequencing library containing reads from roughly 20,000 cells normally at 400 million reads. This script reads in the results of this first run, and then selects 18 cells for target amplification according to a set of criteria.

We will then PCR-amplify reads from those cells in the second aliquote by designing cell-barcode specific primers, and sequence this library again.

First run

In the first sequencing run of our cta-seq project, we re-sequenced the cDNA from the original experiment in “Amplification of human interneuron progenitors promotes brain tumors and neurological defects”.

First, we shortly analyse the resulting data with a standard pipeline to check if everything looks as expected.

sc_data <- Read10X(data.dir = '../cta_seq/outs/filtered_feature_bc_matrix/')

write.csv2(sc_data, file = '../data/cta_seq_typical.csv')

seurat_object <- CreateSeuratObject(sc_data, project = 'cta_seq_data', min.cells = 100, min.features = 0)

QC

First we visualize some basic QC metrics:

seurat_object <- AddMetaData(seurat_object, col.name = 'percent.mt', metadata = PercentageFeatureSet(seurat_object, pattern = "^MT-"))
VlnPlot(seurat_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot1 <- FeatureScatter(seurat_object, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(seurat_object, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

Based on these plots we set the QC cutoffs as nFeature_RNA > 1000 & nFeature_RNA < 5000 & percent.mt < 8. After this filter the QC Violinplots show these distributions:

seurat_object <- subset(seurat_object, subset = nFeature_RNA > 1000 & nFeature_RNA < 5000 & percent.mt < 8)

VlnPlot(seurat_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Preprocessing

Next we perform default preprocessing steps such as Log-Normalization, Feature-Selection and Scaling. We then performed PCA and use the first 50 PCs for all following analysis steps. Based on the PCs we calculate the UMAP coordinates.

seurat_object <- NormalizeData(seurat_object, normalization.method = "LogNormalize", scale.factor = 10000)
seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(seurat_object)
seurat_object <- ScaleData(seurat_object, features = all.genes)
seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object))
seurat_object <- RunUMAP(seurat_object, dims = 1:50)

Clustering

We now use the Louvain algorithm to perform clustering in 50 dim PC space at a resolution of 0.1. This resolution is set very low to get only 4 clusters, the same as in the original paper. At this low resolution cluster identities are pretty obvious, but we still want to confirm them using the same marker genes as shown in the supplement of the paper.

seurat_object <- FindNeighbors(seurat_object, dims = 1:50)
seurat_object <- FindClusters(seurat_object, resolution = 0.1, verbose = FALSE)


umap_plot <- DimPlot(seurat_object, reduction = "umap") + scale_y_reverse()
umap_plot

ggsave(paste0(folder,'umap_plot.png'), umap_plot, width = 6.6, height=4)

Here we visualize the QC features again in UMAP space:

umap_data <- tibble(UMAP_1 = Embeddings(seurat_object, reduction = 'umap')[,1],
                    UMAP_2 = Embeddings(seurat_object, reduction = 'umap')[,2],
                    nCount_RNA = seurat_object$nCount_RNA,
                    nFeature_RNA = seurat_object$nFeature_RNA,
                    percent.mt = seurat_object$percent.mt, 
                    barcode = colnames(seurat_object),
                    seurat_clusters = seurat_object$seurat_clusters)

tmp_plot_1 <- ggplot(umap_data) + geom_point(aes(x = UMAP_1, y = UMAP_2, col = nCount_RNA, nFeature_RNA = nFeature_RNA, mt = percent.mt, seurat_clusters = seurat_clusters), cex=.3) + scale_color_viridis() + scale_y_reverse() + theme(legend.position = 'bottom')
tmp_plot_2 <- ggplot(umap_data) + geom_point(aes(x = UMAP_1, y = UMAP_2, nCount_RNA = nCount_RNA, col = nFeature_RNA, mt = percent.mt, seurat_clusters = seurat_clusters), cex=.3) + scale_color_viridis() + scale_y_reverse() + theme(legend.position = 'bottom')
tmp_plot_3 <- ggplot(umap_data) + geom_point(aes(x = UMAP_1, y = UMAP_2, nCount_RNA = nCount_RNA, nFeature_RNA = nFeature_RNA, col = nCount_RNA/nFeature_RNA, mt = percent.mt, seurat_clusters = seurat_clusters), cex=.3) + scale_color_viridis() + scale_y_reverse() + theme(legend.position = 'bottom')

ggarrange(tmp_plot_1, tmp_plot_2, tmp_plot_3)

For cell-type identification we can use a DotPlot showing the expression levels of various known marker genes. The DotPlot below shows high similarity to the original Figure S5B (Ignoring the visual style):

marker_genes <- c('NEUROD6', 'NEUROD2', 'BCL11A', 'HES1', 'TTYH1', 'SLC1A3', 'TYMS', 'MKI67', 'TOP2A', 'DLX6-AS1', 'DLX5', 'SCGN')
DotPlot(seurat_object, features = marker_genes, scale = FALSE, dot.scale = 20) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_color_viridis(trans = "log2")

We can also visualize the expression of the same genes in UMAP-space:

FeaturePlot_but_good(seurat_object, features = marker_genes, point_size = .3, slot = 'data') + scale_y_reverse()

Cell selection

As for which cells to amplify; We defined a set of criteria:

To implement this we first run k-nearest neighbor clustering on the data-set and kick out neighborhoods that span two clusters. Next we rank genes by the number of reads, and the count to feature ratio CF to get the nCount_rank and the CF_rank. \[\begin{equation} CF = \frac{\text{Number of Reads}}{\text{Number of expressed Genes}} \end{equation}\] We use CF rather than the number of expressed genes because we are especially interested in cells with many reads but few expressed genes, and the other way around.

We then iterate over all cells and look for the most diverse neighborhood per cluster.

Here each cell is defined by it’s nCount_rank and it’s CF_rank. A neighborhood of 3 cells therefore induces a triangle in nCount_rank X CF_rank space. We define the most diverse neighborhood as the one for which the area of this triangle is maximal.

We can then delete these cells from the table and repeat the procedure to get the second most diverse neighborhoods as well. This way we select 18 cells.

We chose to do this on the ranks, rather than on the actual values to make nCounts and CF comparable.

The selected cells are shown below:

# Define neíghbourhoods
seurat_object <- FindNeighbors(seurat_object, dims = 1:50, return.neighbor = T, k.param = 3)
neighbourhood_info <- tibble(cell_1 = seurat_object@neighbors$RNA.nn@cell.names,
                             cell_2 = cell_1[seurat_object@neighbors$RNA.nn@nn.idx[,2]],
                             cell_3 = cell_1[seurat_object@neighbors$RNA.nn@nn.idx[,3]],
                             current_neighbours = cbind(cell_1, cell_2, cell_3))
# Score the neighbourhoods
neighbourhood_scores <- seurat_object %>%
  select(-c(orig.ident, PC_1:PC_50, umap_1, umap_2)) %>%
  group_by(seurat_clusters) %>%
  mutate(nFeature_Rank = rank(nFeature_RNA, ties.method = 'random'),
         nCount_Rank = rank(nCount_RNA, ties.method = 'random'),
         Count_Feature_ratio = nCount_RNA/nFeature_RNA,
         CF_Rank = rank(Count_Feature_ratio, ties.method = 'random')) %>%
  left_join(neighbourhood_info, by = c('.cell' = 'cell_1')) %>%
  group_by(.cell) %>%
  filter(seurat_clusters != 3)

neighbourhood_scores <- filter_split_neighbours(neighbourhood_scores)
neighbourhood_scores <- calc_dist_sum(neighbourhood_scores, dist_mode = 'triangle')
neighbourhood_scores <- neighbourhood_scores %>%
  arrange(desc(sum_of_dists)) %>%
  group_by(seurat_clusters) %>%
  mutate(is_top = NA) %>%
  mutate(is_top = replace(is_top, row_number() == 1, 1))

neighbour_info_to_join <- neighbourhood_scores %>%
  select(-c(cell_2, cell_3, current_neighbours, is_top, sum_of_dists))

# Select neighbourhoods
top_cells <- neighbourhood_scores %>%
  select(.cell, cell_2, cell_3, is_top, sum_of_dists, seurat_clusters) %>%
  filter(is_top == 1) %>%
  pivot_longer(cols = c(.cell, cell_2, cell_3), values_to = '.cell', names_to = NULL) %>%
  left_join(neighbour_info_to_join) %>%
  mutate(is_top = as_factor(is_top))  

neighbourhood_scores_filtered <- neighbourhood_scores %>%
  filter(!(.cell %in% top_cells$.cell | cell_2 %in% top_cells$.cell | cell_3 %in% top_cells$.cell)) %>%
  mutate(is_top = replace(is_top, row_number() == 1, 2))

top_cells_2 <- neighbourhood_scores_filtered %>%
  select(.cell, cell_2, cell_3, is_top, sum_of_dists, seurat_clusters) %>%
  filter(is_top == 2) %>%
  pivot_longer(cols = c(.cell, cell_2, cell_3), values_to = '.cell', names_to = NULL) %>%
  left_join(neighbour_info_to_join) %>%
  mutate(is_top = as_factor(is_top))  

top_1_2 <- rbind(top_cells, top_cells_2)

#Visualize neighbourhoods
umap_data <- tibble(UMAP_1 = Embeddings(seurat_object, reduction = 'umap')[,1], UMAP_2 = Embeddings(seurat_object, reduction = 'umap')[,2], .cell = colnames(seurat_object)) %>% left_join(neighbour_info_to_join) %>% left_join(top_1_2)

count_plot <- ggplot() + geom_point(data = filter(umap_data, is.na(is_top)), aes(x = UMAP_1, y = UMAP_2, nCount_Rank = nCount_Rank, nFeature_Rank = nFeature_Rank, CF_Rank = CF_Rank, .cell = .cell, mt = percent.mt, seurat_clusters = seurat_clusters), col = 'darkgrey') + geom_point(data = filter(umap_data, !is.na(is_top)), aes(x = UMAP_1, y = UMAP_2, nCount_Rank = nCount_Rank, nFeature_Rank = nFeature_Rank, CF_Rank = CF_Rank, col = is_top, .cell = .cell, mt = percent.mt, seurat_clusters = seurat_clusters)) + scale_y_reverse()

ggsave(paste0(folder,'selected_plot.png'), count_plot, width = 6.6, height = 4)

count_plot

tmp_tibble <- tibble(seurat_clusters = seurat_object$seurat_clusters, .cell = colnames(seurat_object)) %>% filter(seurat_clusters != 3) %>% right_join(neighbour_info_to_join) %>% left_join(top_1_2)
spread_plot <- ggplot() + geom_point(data=filter(tmp_tibble, is.na(is_top)), aes(x = nCount_Rank, y = CF_Rank, col = is_top), alpha=1) + geom_point(data=subset(tmp_tibble,!is.na(is_top)), aes(x = nCount_Rank, y = CF_Rank, col = is_top), alpha=1) + geom_abline(slope=1) + facet_wrap(~seurat_clusters)

ggsave(paste0(folder,'spread_plot.png'), spread_plot, width = 6.6, height = 4)

We can visualize the selected cells triplets in the CF_rank to nCount_Rank space in which they were selected:

ggplot() + geom_point(data=filter(tmp_tibble, is.na(is_top)), aes(x = nCount_Rank, y = CF_Rank), col = 'darkgrey', alpha=1) + geom_point(data=subset(tmp_tibble,!is.na(is_top)), aes(x = nCount_Rank, y = CF_Rank, col = is_top), alpha=1) + geom_abline(slope=1) + facet_wrap(~seurat_clusters)

ggplot() + geom_point(data=filter(tmp_tibble, is.na(is_top)), aes(x = nCount_RNA, y = Count_Feature_ratio), col = 'darkgrey', alpha=1) + geom_point(data=subset(tmp_tibble,!is.na(is_top)), aes(x = nCount_RNA, y = Count_Feature_ratio, col = is_top), alpha=1) + geom_abline(slope=1) + facet_wrap(~seurat_clusters)

ggplot() + geom_point(data=filter(tmp_tibble, is.na(is_top)), aes(x = nCount_RNA, y = nFeature_RNA), col = 'darkgrey', alpha=1) + geom_point(data=subset(tmp_tibble,!is.na(is_top)), aes(x = nCount_RNA, y = nFeature_RNA, col = is_top), alpha=1) + geom_abline(slope=1) + facet_wrap(~seurat_clusters)

The resulting cells and their characteristics are:

summary_data <- umap_data %>% filter(!is.na(is_top)) %>% select(c(.cell, seurat_clusters, is_top, nCount_RNA, nFeature_RNA, Count_Feature_ratio, percent.mt, sum_of_dists)) %>% arrange(seurat_clusters) %>% arrange(seurat_clusters, is_top)
kable(summary_data, format ='html') %>% kable_styling()
.cell seurat_clusters is_top nCount_RNA nFeature_RNA Count_Feature_ratio percent.mt sum_of_dists
ACGTCCTAGTGCCCGT-1 0 1 4993 1650 3.026061 4.9469257 8369712
GGCTTTCGTGTGTCGC-1 0 1 8965 2635 3.402277 1.6397100 8369712
GGGAGATTCTTAAGGC-1 0 1 2888 1515 1.906271 2.3891967 8369712
AATGACCGTGTGTGTT-1 0 2 4082 2247 1.816644 2.6702597 6911850
CATTGCCCAGAGACTG-1 0 2 12400 4035 3.073110 1.5322581 6911850
GTCTGTCTCGTAGGAG-1 0 2 7182 3258 2.204420 2.5201894 6911850
GAGAGGTAGGGAGGTG-1 1 1 5890 2026 2.907206 6.7741935 10658644
GCTTGGGAGGGTAATT-1 1 1 2098 1229 1.707079 6.2917064 10658644
TCATGTTTCCGCAACG-1 1 1 2429 1049 2.315539 5.3108275 10658644
GAGAGGTAGACGGTTG-1 1 2 1803 1144 1.576049 0.2218525 8719817
TAGACCAGTATTTCCT-1 1 2 6436 2570 2.504280 0.2019888 8719817
TCTTGCGTCCTACGGG-1 1 2 3116 1292 2.411765 1.0590501 8719817
GGGTATTAGAGGTTTA-1 2 1 6034 2092 2.884321 5.2038449 1671405
TATCGCCTCACATTGG-1 2 1 9646 2946 3.274270 4.1053286 1671405
TGTTCTATCGGTAGAG-1 2 1 4704 2044 2.301370 7.1641156 1671405
AGTTCCCCAGGCCCTA-1 2 2 13871 3745 3.703872 3.7272006 1652972
GAGTTACCAATAGGAT-1 2 2 5866 2176 2.695772 4.0402318 1652972
TCTCTGGGTGTGCTTA-1 2 2 6248 2570 2.431128 3.3610755 1652972
write_csv(summary_data, file = paste0(folder,'selected_cells.csv'))

Hamming Distances

To avoid any issues during PCR amplification we have to ensure that the primer sequences are not too similar.

We can calculate the Hamming Distances between the selected Cell-barcodes to ensure we can design sufficiently different primers and also show them in a Heatmap:

barcodes <- str_split_fixed( summary_data$`.cell`, pattern = '-', n = 2)[,1]

write_csv(tibble(cell = 1:18, barcodes = barcodes), file = 'select_cells/barcodes.csv')

hamming_dist <- function(x, y){
  require(stringr)
  x <- unlist(str_split(x, pattern = ''))
  y <- unlist(str_split(y, pattern = ''))
  tmp_dist <- sum(x != y)
  return(tmp_dist)
}


barcode_info <- expand(tibble(barcode_1 = barcodes, barcode_2 = barcodes), barcode_1, barcode_2) %>% mutate(hamming_distance = map2_dbl(barcode_1, barcode_2, hamming_dist))
hamming_distances <- barcode_info %>% filter(barcode_1 != barcode_2) 
summary(hamming_distances$hamming_distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   10.00   12.00   11.56   13.00   16.00
ggplot(barcode_info) + geom_tile(aes(x = barcode_1, y = barcode_2, fill = hamming_distance)) + scale_fill_viridis() + geom_text(aes(x = barcode_1, y = barcode_2, label = hamming_distance)) + theme(axis.text.x = element_text(angle=90))